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Abstract Inspired by RNA-editing as occurs in transcriptional processes in the living cell, we in- 
troduce an abstract notion of string adjustment, called guided rewriting. This formalism allows si- 
multaneously inserting and deleting elements. We prove that guided rewriting preserves regularity: 
for every regular language its closure under guided rewriting is regular too. This contrasts an earlier 
abstraction of RNA-editing separating insertion and deletion for which it was proved that regularity 
is not preserved. The particular automaton construction here relies on an auxiliary notion of slice 
sequence which enables to sweep from left to right through a completed rewrite sequence. 



1 Introduction 

We study an elementary biologically inspired formalism of string replacement referred to as guided 
rewriting. Given a fixed and finite set G of strings, also called guides, a rewriting step amounts to adapting 
a substring towards a guide. We consider two versions of guided rewriting: guided insertion/deletion, 
which is close to an editing mechanism as encountered in the living cell, and general guided rewriting 
based on an adjustment relation, which is mathematically more amenable. For guided insertion/deletion 
the guide and the part of the string that is rewritten do not need to be of the same length. They are 
required to be equal up to occurrences of a distinguished dummy symbol. For general guided rewriting 
the correspondence of the guide and the substring that is rewritten is element-wise. The guide and 
substring are equivalent symbol-by- symbol according to a fixed equivalence relation called adjustment. 

In both cases, for a finite set of guides G, only a finite set of strings can be obtained by repeatedly 
rewriting a given string. Starting from a language L, we may consider the extension h\u of the language 
with all the rewrites obtained by guided insertion/deletion and the extension Lq of the language obtained 
by adding all the adjustment-based guided rewrites. We address the question if regularity of L implies 
regularity of Lju and of Lq. The results of the paper state that in the case of guided insertion/deletion 
regularity is preserved if the strings of dummy symbols involved are bounded and that guided rewriting 
based on adjustment always preserves regularity. 

The motivation for this work stems from transcriptional biology. RNA can be seen as strings over the 
alphabet {C, G, A, U}. Replication of the encoded information is one of the most essential mechanisms in 
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life: strands of RNA are faithfully copied by the well-known processes of RNA-transcription. However, 
typical for eukaryotic cells, the synthesis of RNA does not yield an exact copy of part of the DNA, 
but a modification obtained by post-processing. The class of the underlying adjustment mechanisms is 
collectively called RNA-editing. 

Abstracting away from biological details, the computational power of insertion-deletion systems for 
RNA-editing is studied in lPT4l : an insertion step is the replacement of a string uv by the string uav taken 
from a particular finite set of triples u, a, v. Similarly, a deletion step replaces uav by uv for another finite 
set of triples u,a,v. In [ 10 ] the restriction is considered where u and v are both empty. The approach 
claims full computational power, that is, they generate all recursively enumerable languages. 

In the RNA-editing mechanisms occurring in nature, however, only very limited instances of these 
formats apply. Often only the symbol U is inserted and deleted, instead of arbitrary strings a, see 
e.g. Q]. Therefore, following ifTSTl , we investigate guided insertion/deletion focusing on the special role 
of the distinguished symbol 0, the counterpart of the RNA-base U. However, in order to prove that 
under this scheme regularity is preserved we extend our investigations to guided rewriting based on an 
abstraction adjustment relation. In fact, we prove the theorem for guided insertion/deletion by appealing 
to the result for guided rewriting based on adjustment. 

The proof of the latter result relies on reorganizing sequences of guided rewrites into sequences of 
so-called slices. The point is that, since guides may overlap, each guided rewrite step adds a 'layer' 
on top of the previous string. In this sense guided rewriting is vertically oriented. E.g., Figure [2] in 
Section [5] shows six rewrite steps of the string ebcfa yielding the string fbcfb involving eight layers in 
total. However, in reasoning about recognition by a finite automaton a horizontal orientation is more 
natural. One would like to sweep from left to right, so to speak. Again referring to Figure [2j five slices 
can be distinguished, viz. a slice for each symbol of the string ebcfa. The technical machinery developed 
in this paper allows for a transition between the two orientations. 

The organization of this paper is as follows. The biological background of RNA-editing is provided in 
Section[2] Section [3]presents the theorem on preservation of regularity for guided insertion-deletion. The 
notion of guided rewriting based on an adjustment relation is introduced in Section[4]and a corresponding 
theorem on preservation of regularity is presented. To pave the way for its proof, Section [5] introduces 
the notions of a rewrite sequence and of a slice sequence and establishes their relationship. Rewrite 
sequences record the subsequent guided rewrites that take place, slice sequences represent the cumulative 
effect of all rewrites at a particular position of the string being adjusted. In Section[6]we provide, given a 
finite automaton accepting a language L, the construction of an automaton for the extended language Lq 
with respect to a set of guides G. Section [7] wraps up with related work and concluding remarks. 

Acknowledgment We acknowledge fruitful feedback from Peter van der Gulik and detailed comment 
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2 Biological motivation 

This section provides a description of RNA editing from a biological perspective. In this paper we 
focus on the insertion and deletion of uracil in messenger RNA (mRNA) and provide abstractions of 
the underlying mechanism in the sequel. However, in the living cell there are different kinds of RNA 
editing that vary in the type of RNA that is edited and the type of editing operations. Uracil is represented 
by the letter U. The three other types of nucleotides for RNA, viz. adenine, guanine and cytosine are 
represented by the letters A, G and C, respectively. 

[/-insertion/deletion editing is widely studied in the mitochondrial genes of kinetoplastid proto- 
zoa Ifl3l . Kinetoplastids are single cell organisms that include parasites like Trypanosoma brucei and Cri- 
thidia fasciculata, that can cause serious diseases in humans and/or animals. Modifications of kineto- 
plastid mRNA are usually made within the coding regions. These are the parts that are translated into 
proteins, which are the building blocks of the cells. This way coded information of the original gene can 
be altered and therefore expressed, i.e. translated into proteins, in a varying number of ways, depending 
on the environment in the cell. This provides additional flexibility as well as potential specialization of 
different parts of the organisms for particular functions. 

Here we describe a somewhat simplified version of the mechanism for the insertion and deletion of U. 
More details can be found, for instance, in |[T3l [Tl l3l fl2l . For simplicity we assume that only identical 
letters match with one another. In reality, the matching is based on complementarity, usually assuming 
the so-called Crick- Watson pairs: A matches with U and G matches with C. 

In general, a single step in the editing of mRNA involves two strands of RNA, a strand of messenger 
RNA and a strand of guide RNA, the latter typically referred to as the guide. To explain the mechanism 
for the insertion of uracil, let us consider an example. See Figure [T] Assume that we start of with 
an mRNA fragment: u = N1N2N3N4N5 and the guide g = N2N3UUUN4, where Ni can be an arbitrary 
nucleotide A, G or C, but not U . Obviously, there is some match between u and g involving the letters 
Nz, N3, and N4, which is partially 'spoiled' by the UUU sequence. By pairing of letters we have that g 
attaches to u; the matching substrings N2N3 and N4 serve as anchors. 
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Figure 1 : Various stages of guided U -insertion 

By chemical reactions involving special enzymes u is split open between and N4. The gap between 
the anchors is then filled by the enzyme mechanism using the guide as a template. For each letter U in the 
guide a U is added also in the gap. As a result the mRNA string u is transformed into N1N2N3UUUN4N5. 
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In general, one can have more than two anchors (involving only non-U letters) in which the guide and the 
mRNA strand match. In that case mRNA is opened between each pair of anchors and all gaps between 
these anchors are filled with U such that the number of Us in the guide is matched. 

A similar biochemical mechanism implements the deletion of Us from a strand of mRNA. We 
illustrate the deletion process on the following example. Let us assume that we have the mRNA strand 
u = N1N2N3UUN4N5 and the guide g = N2N3N4. Like in the insertion case, g initiates the editing by 
attaching itself to u at the matching positions A^A^, and N4. Only now the enzymatic complex removes 
the mismatching UU substring between A3 and N4 to ensure the perfect match between the substring 
and the guide. As a result the edited string AfiA^A^AWs is obtained. In general, we can have several 
anchoring positions on the same string. In that case, all Us between each two matching positions are 
removed from the mRNA. 

A guide can also induce both insertions and deletions of U simultaneously. For instance the guide 
N2N3UUUN4 can induce editing in parallel of the string NyUNiUN^UNAUNsUNf, which results in the 
string NyUNiN^UUUN^UNsUNf,, where the U between N2 and A3 has been deleted and two U's between 
A3 and N4 have been inserted. This is done by the same biochemical mechanisms that are involved in 
separate insertions and deletions. Analogously as above, we can have multiple insertions and deletions 
induced by the same guide on the original pre-edited sequence. 

The net effect of all three cases considered above is that a strand u = xyz, such that y equals g 
up to occurrences of U, is modified by the insertion and deletion mechanism and becomes a string 
v = xgz. It is noteworthy that the rewriting system that we describe in the sequel also applies to another 
case with the same effect. For example, consider a guide g = N2N3UUUN4 and a pre-edited mRNA 
u = N1N2N3UUN4N5N6. Now, to obtain the match of the guide g and a substring y of u, a U is inserted 
in u, resulting in the string v = Ni^NiUUU^NsNf,. If the U subsequence in y was longer though, like 
in the case for u' = NiN 2 Nt,UUUN4N 5 N 6 and g' = N2N3UUN4, then we have that the extra U in u' is 
removed resulting in v' = N1N2N3UUN4N5N6. 

To summarize, the mRNA editing mechanism underlying U -insertion/deletion can be interpreted as 
symbolic manipulations of strings. In the sequel symbol U will be denoted by and obviously plays a 
special role. The crucial point is that in a single step some substring y is replaced by a guide g for which 
y and g coincide except for the symbol 0. 

3 Guided insertion /deletion 

Inspired by the biological scheme of editing of mRNA as discussed in the previous section, we study 
the more abstract notion of guided insertion and deletion and guided rewriting based on an adjustment 
relation in the remainder of this paper. In this section we address guided insertion and deletion, turning 
to guided rewriting in Section [4] 

More precisely, fix an alphabet Eo and distinguish ^ Eo- Put E = Eo U {0}- Choose a finite set 
GCI*, with elements g also referred to as guides. Reflecting the biological mechanism, we assume that 
each g £G has at least two letters and that the first and last letter of each g £G are not equal to 0. Hence, 
G C Eq-E*-Eq. Now a guided insertion/deletion step with respect to G is given by 



where y G Eo-E*-Eo, and 7l(y) and 71(g) are obtained from y and g, respectively, by removing their 0s. 
Thus, 71 : E* — > Eq is the homomorphism such that 7i(e) = e, 7t(0) = e and n(a) = a for a G Eq. So, 
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intuitively, g is anchored on the substring y of u and sequences of Os are adjusted as prescribed by the 
guide g, in effect replacing the substring y by the guide g while maintaining the prefix x and suffix z. 

As a simple example of a single guided insertion/deletion step, for G = {g} with g = bcbOOOabOc 
and u = aOObcOObabccOOaOOb we have u =>j/d v for v = aOObcbOOOabOccOOaOOb. Here it holds that u = 
aOO • bcOObabc ■ cOOaOOb, x(bc00babc) = bcbabc = n (bcbOOOabOc) and v = a00- bcbOOOabOc ■ cOOaOOb. 
Note, for the string v, being the result of a rewrite with guide g itself with only one possible anchoring, 
only trivial steps can be taken further. So, the operation of guided insertion/deletion with the same 
guide g at the same position in a string is idempotent. However, anchoring may overlap. Consider the 
set of guides G = { aaOa, aOaa }, for example. Then the string aaa yields an infinite rewrite sequence 

aaa =>ijd aaOa =^i/d aOaa ^i/d aaOa ^i/d aOaa- ■ ■ 

Still, from aaa only finitely many different rewrites can be obtained by insertion/deletion steps guided 
by this G, viz. { aaa, aaOa, aOaa }. 

The restrictions put on G exclude arbitrary deletions (possible if e would be allowed as guide) and 
infinite pumping (if guides need not be delimited by symbols from Lq). As an illustration of the latter 
case, starting from the string abc and 'guide' Oab, the infinite sequence abc ^>i/d Oabc =^j/d OOabc =^i/d 
OOOabc. . . would be obtained. The restriction on the substring y prevents to make changes outside the 
scope of the guide g and forbids aObOOOc =>i/d abOc by way of the guide ab. 

As a first observation we show that the set L"^ d = { v G £* | u =>*j d v }, for any finite set of guides G 

and any string u, is finite. Write u = aoO' l a\ . . .a ni 0'"a n where a\ G £o> h ^ 0, for some n ^ 0. In effect, 
a guided insertion/deletion step only modifies the substrings 0'* or leaves them as is. Therefore, after one 
or more guided insertion/deletion steps the substrings 0'* are strings taken from the set 

Z u ijd = {0 h | 1 <Jfc<n}U{0' \xa-0 £ bzEG, a,b G So, 0} 

Thus, if u =^* /d v then v € U /d , where U! /d = { aozm . . . a, n z n a n \ Zk G Zf/ d , l^k^n}, i.e. H /d C L u i/d . 
Since the set G is finite, it follows that Z", d is finite, that L l ^ d is finite and that L"^ d is finite as well. 

More generally, given a set of guides G, we define the extension by insertion/deletion L,y d of a 
language L over E by putting L^ d = {vGE* | 3«GL: m v }• lasted to the biological setting of 
Section [2| L are the strands of messenger RNA, G are strands of guide RNA. Next, we consider the 
question whether regularity of the language L is inherited by the induced language L,y d . Note, despite 
the finiteness of the insertion/deletion scheme for a single string, it is not obvious that such would hold. 

For example, consider the language corresponding to the regular expression (ab)* together with 
the operation sort which maps a string w over the alphabet {a,b} to the string a n b m where n = # fl (w), 
m = #b{w). Thus sort(w) is a sorted version of w with the a's preceding the b's. Note, for w G (ab)* 
there is only one string sort(w), as sorting is a deterministic, hence finitary operation. However, despite 
Jzf ( (ab)* ), the language associated by the regular expression, is regular, the language 

sort( (ab)* ) = { sort(w) | w G (ab)* } = { a n b" \n^0} 

is not regular. Also, if we define the rewrite operation ba —>r ab, then { v G {a,b}* | u -^* R v } contains 
shuffles of the string u, i.e. all strings over {a,b} having the same number of a's and b's but are smaller 
lexicographically. Thus, the set { v G {a,b}* \ u — >* R v} is finite for each string u. However, the language 
L = { v G {a,b}* | 3u G L : u -^-* R v} cannot be regular: intersection with the language of a*b* does 
not yield a regular language. More specifically, Ln ^£(a*b* ) = { a n b n |n^0}. We conclude that the 
question of L t / d being regular, given regularity of the language L, is not straightforward. 
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With the machinery of rewrite sequences and slice sequences developed in the sequel of the paper, 
we will be able to prove the following for guided insertion/deletion. 

Theorem 1. In the setting above, ifh is a regular language and for some number it holds that no 
string ofL or G contains k (or more) consecutive O's, then the language L^^ is regular too. 

We will prove Theorem [T] by applying a more general result on guided rewriting, viz. Theorem [3] for- 
mulated in the next section and ultimately proven in Section [6] As in the notion of guided rewriting as 
developed in the sequel, symbols are only replaced by single symbols by which lengths of strings are 
always preserved, a transformation is required to be able to apply Theorem [3] 

Before doing so we relate our results to those of fl31l . There a relation similar to was introduced, 
with the only difference that in a single step either O's are deleted or inserted, but not simultaneously. 
One of the conclusions of lPT5l is that in that setting regularity is not preserved, so the opposite of the 
main result in the present setting. 

4 Guided rewriting 

The idea of guided rewriting is that symbols are replaced by equivalent symbols with respect to some ad- 
justment relation ~. The one-one correspondence of the symbols of the string u and its guided rewrite v, 
enjoyed by this notion of reduction, will turn out technically convenient in the sequel. 

Let £ be a finite alphabet and ~ an equivalence relation on E, called the adjustment relation. If a ~ b 
we say that a can be adjusted to b. For a string h6I* we write #u for its length, use u[i] to denote its z'-th 
element, i = 1, . . . ,#u, and let u[p,q\ stand for the substring u[p] u[p+l] ■ ■ ■ u[q]. The relation ~ is lifted 
to E* by putting 

u ~ v iff #u = #v A Vi = 1, . . . ,#u : u[i] ~ v[i] 
Next we define a notion of guided rewriting that involves an adjustment relation. 
Definition 2. We fix a finite subset G C E* called the set of guides. 

(a) For u,v E E* g E G, p 0, we define u ^ g . p v, stating that v is the rewrite of u with guide g at 
position p, by 

u ^g,p v iff 3x,y,z E E* : u = xyz A #x = p A y ~ g A v = xgz, 

(b) We write w v ifu =^g, p v for some g E G and p ^ 0. We use =^>* to denote the reflexive transitive 
closure of=>. A sequence u\ =4> «2 • • • u n is called a reduction. 

(c) For a language L over E and a set of guides G we write 

L G = {v E E* | 3u E L: u =>* v} 

So, a =^-step adjusts a substring to a guide in G element-wise, and Lq consists of all strings that can 
be obtained from a string from L by any number of such adjustments. For example, if E = {a,b,c}, 
G = {bb} and a ~ b but not a ~ c, then by a =^-step two consecutive symbols not equal to c are replaced 
by two consecutive /Vs. In particular, aaacaa —>bb\ abbcaa and abbcaa -^bbfi bbbcaa. We have 

{aaacaa}c = {aaacaa, bbacaa, abbcaa, aaacbb, bbbcaa, abbcbb, bbacbb ,bbbcbb} 
Next, we state the main result of this paper regarding guided rewriting as given by Definition [2] 
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Theorem 3. Given an equivalence relation ~ on E, let G be a finite set of guides. Suppose L is a regular 
language. Then Lq is regular too. 

Before going to the proof, we first show that both finiteness of G and the requirement of ~ being an 
equivalence relation are essential. Below, for a regular expression r we write J£?(r) for its corresponding 
language. 

To see that finiteness of G is essential for Theorem [3] to hold, let G = {ca k cb k c \ k ^ 0} and 
L = J£(ca*ca*c). Let ~ satisfy a ~ b but not a ~ c. Then all elements of L on which an adjustment 
is applicable are of the shape ca k ca k c, where the result of the adjustment is ca k cb k c, which can not be 
changed by any further adjustment. So 

Lq n Sf(ca*cb*c) = {ca k cb k c \k^Q) 

is not regular. Since regularity is closed under intersection we conclude that Lq cannot be regular itself. 

Also equivalence properties of ~ are essential for Theorem[3] For G = { ab } and ~ = { (a, b) , (b, a) } 
the only possible =>-steps are replacing the pattern ba by ab. Note that here ~ is neither reflexive nor 
transitive. Since ba may be replaced by ab, bubble sort on a's and Z?'s can be mimicked by =^*, while on 
the other hand =^>* preserves both the number of a's and the number of b's. Hence 

&({ab)*) G n ^(a*b*) = {a k b k \k^0} 

which proves that J£((ab)*)Q is not regular, again since regularity is closed under intersection. 

5 Rewrite sequences and slice sequences 

Fix an alphabet £, an adjustment relation ~, and a set of guides G. 

Definition 4. A sequence p = (gkiPk)k=\ of guide -position pairs is called a guided rewrite sequence 
for a string u G Z* if it holds that (i) gk G G, (ii) ^ pk ^#u—#gk, and (in) ulpk+l^Pk+ttgk}^ gk, for 
all k= 1 , . . . , r. 

A guide-position pair (g,p) indicates a redex for a guided rewrite with g of the string u. The position p 
is relative to u. For the rewrite to fit we must have p + #g ^ #u. The first p symbols of u, i.e. the 
substring u[l,p], are not affected by the rewrite, as are the last #u — p+#g symbols of u, i.e. the substring 
u\p+#g+l,#u]. 

The sequence p induces a sequence of strings (uk) r k=0 by putting uo = u and u k such that Uk-i =>g k ,p k 
ujc for k = 1, . . . ,r. To conclude that Uk-i => gk , Pk u k is indeed a proper guided rewrite step, in particular that 
we have Uk-i [pk+l,Pk+#gk], we use the assumption u{pk+l,Pk+#gk\ ~ gk and the fact that if u =^ g , p v 
then u [p + 1 , p + #g] ~ v [p + 1 , p + #g] . So we obtain u =^>* u r by construction. The string u,- is referred 
to as the yield of p for u, notation yield(p). Conversely, every specific reduction from u to v gives rise to 
a corresponding guided rewrite sequence for u. 

Definition 5. Let a G £. A sequence si = (gi,qi)iei of guide-offset pairs, for / CR a finite index set, is 
called a slice for a and G if it holds that (i) gi £ G, (ii) 1 ^qt^#gi, and (Hi) a~ gi[qi], for all i G /. The 
slice si is called a slice for a string u G T* at position n, 1 n ^ #u, if it is a slice ofu[n\. 

Note that in a guide-offset pair (g,q) of a slice sequence, the offset q is relative to the guide g. Since 
we require 1 ^ q ^ #g for such a pair, the symbol g [q] is well-defined. We will reserve the use of q for 
offsets, indices within a guide, and the use of p for positions after which a rewrite may take place, i.e. 
for lengths of proper prefixes of a given string. 
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The goal of the notion of slice is to summarize the effect of a number of guided rewrites local to a 
specific position within a string. The symbol generated by the last rewrite that affected the position, i.e. 
the particular symbol of the last element of the slice sequence, is part of the overall outcome of the total 
rewrite. This symbol is called the yield of the slice. More precisely, if 7 ^ 0, the yield of a slice si for a 
symbol a is defined as yield(sl) = g, max [q^] where i max = max (7). In case 7 = 0, we put yield(sl) = a. 
Occasionally we write a ~ si, as for a slice si for a symbol a it always holds that a^yield(sl). 

A slice si is said to be repetition-free if g,- = gj A qt = qj implies i = j. If we have 7 = 0, the slice si 
is called the empty slice. 

Next we consider sequences of slices, and investigate the relationship between slices on two consecutive 
positions in a guided rewrite sequence. 

Definition 6. A sequence o = (tf^ n )*Li w called a slice sequence for a string u if the following holds: 

• sl n is a slice for u at position n, for n= 1 , . . . ,#u; 

• for n= 1 , . . . ,#u— 1, putting s£ n = (gi,qi)iei and sl n +\ = (gj, q[)iej, there exists a monotone partial 
injection y n : I — > J such that, for all i G I and j £ J, 

- i £ dom{y n ) q t = #g; 

- Yn (0 = j gi = g'j A qi + 1 = q'j 

- j i rng(Y n ) q'j = 1 

• the slices sl\ and s£# u , say sl\ = {guqdiei an d sl# u = (g'j,q'j)j e j, satisfy qi = 1, for all i € I, and 
q'j = #g'j, for all j £ J, respectively. 

For the slices sl n and sl n+ \ the mapping y n : I — > J is called the cut for sl n and sl n+ \. It witnesses that 
^„ and sl n+ \ match in the sense that a rewrite may end at position n, may continue for its next offset at 
position n+l, and may start at position n+l. Since a cut / is an order-preserving bijection from dom{y) 
to rng(y), and dom(y) and rng{y) are finite, it follows that for two slices si, si' the cut si — >■ is unique. 
We write ^ ~~> si' . A slice ^ = (gi,qi)iei is called a start slice if qi = 1 for all i £ 7. Similarly, ^ is 
called an end slice if g,- = #gj for all £ € I. A start slice is generally associated with the first position of 
the string that is rewritten, an end slice with the last position. Note, a start slice as well as an end slice 
are allowed to be empty. The yield of the slice sequence a is the sequence of the yield of its slices, i.e. 
we define yield(o) = yield{sl\) ■ ■ -yield{sl# u ). 

Example 7. Let ~ be the adjustment relation with equivalence classes {a,b},{c,d},{e,f} and let the 
set of guides G be given by G = {gi, g2, g3 } where g\ =fb, g2 = ace and g^ = d. For the string 
u = ebcfa we consider the guided rewrite sequence p = ( (g3,2), (gi,0), (g2, 1), (gi,0), (gi,3), (gi,3) ). 
The associated reduction looks like 

ebcfa =^ 3j2 ebdfa => gl)0 fbdfa => ffij i facea => gl)0 fbcea => gl)3 fbcfb =^ h3 fbcfb (1) 

Recording what happens at all of the five positions of the string u yields, for this example, the slice 
sequence o = (sl n )^ =l given in the table at the left-hand side of Figure^ where the slice sequence is 
visualized too. 

For the choice ofl\ , . . . , ^5, the monotone partial injection y n , n= 1 . . .4, maps every number to itself. 
It is easily checked that all requirements of a slice sequence hold. The ovals covering guide-offset pairs 
reflect the cuts as mappings between to adjacent slices. However, they also comprise, in this situation 
derived from a guided rewrite sequence, complete guides. Note, sl\ is a start slice, sl$ is an end slice. 

We have for the slice sequence o = (s£ n )^ =l that yield(o) = yield{sl\) yield(sl^) = fbcfb. Indeed, 

this coincides with the yield of the guided rewrite sequence p o/([7]). 
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Figure 2: An example slice sequence 

The rest of this section is devoted to proving that the above holds in general: Given a string and a set of 
guides, for every guided rewrite sequence there exists a slice sequence and for every slice sequence there 
exists a guided rewrite sequence. Moreover, the yield of the guided rewrite sequence and slice sequence 
are the same. 

Theorem 8. Let p = (gk,Pk)l=i be a guided rewrite sequence for a string u. Then there exists a slice 
sequence o = (sl n )„'Lifar u such that yield (o) = yield(p). 

Proof sketch. Induction on r. If p is the empty rewrite sequence, we take for a the slice sequence of 
n empty slices. Suppose p is non-empty. Let («jt)jt=o be the sequence of strings induced by p. By 
induction hypothesis there exists a slice sequence a' for the first r— 1 steps of p. Suppose u r -\ =>g nPr u r . 
The slice sequence a is obtained by extending the slices of a' from position p r +l to p r +#g r with the 
pairs (g n n-p r ). Then, 

yield(o) = yield(c'[\,p r ]) ■ g r [\,#g r ]- yield (o'[p r +#g r +\,#u\) 

= w r _i[l,p r ] •g r -M r _i[p r +#g r +l,#« r _i] = u r = yield(p) 



Verification of a being a slice sequence for u requires transitivity of < 



□ 



In order to show the reverse of Theorem [8] we proceed in a number of stages. First we need to relate 
individual guide-offset pairs in neighboring slices. For this purpose we introduce the ordering ==! on 
so-called chunks. 

Definition 9. Let a = {sl n )%U 

be a slice sequence for u. Assume we have st n = (g n ,iiGn,i)iel n > far n = 
l,...,#u. Let Yn '■ 4 - ► 4+1 be the cut for sl n and sl n+ \, 1 ^n <#u. Let & = { (g n j,q n j,i,n) | 1 ^ n ^ 
#w, i £ I n } be the set of chunks of o and define the ordering =<! on 3£ by putting (g,q,i,n) =4 (g',q',i',n' ) 



• either n' ^ n and there exist indexes £o,ho, . . . , fo n /_„ such that 

- £ k , hk G I n+ k and it ^ % ^ k ^ n' — n 

- hk G dom(y n+ k) and y n +k{hk) = &k+\> ^ k <n' — n 

- Iq = i and h n i_ n = i' 

• orn'^n and there exist indexes £o,ho,..., l n - n ' , h n - n > such that 
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- Ikihk G In'+k an d £% ^ hy ^ k ^ n — n' 

- £k G dom(y n i + k) and J n '+k(h) = ^k <n — n' 

- ho = i' and £ n - n > = i 



In the above setting with n' ^ n, we say that the sequence £o,ho, £\,h\, . . ., £ n '-n,h n '-n is leading from 
i G /„ up to /' G I n i. Likewise for the case where n' ^ n. 

For example, for the slice sequence {s£i) r i=l of Figure[2j to identify the guide belonging to the guide- 
offset pair (g2, 1) of slice sl%, the pair is more precisely represented by the chunk (#2, 1,3,2), for the 
pair is associated with index 3 G h of slice s£2- Since for the cuts 72 : 1% — > I3 and 73 : I3 — > I4 we have 
72(3) and 73(3) = 3, we have (#2,1,3,2) =3! (#2,2,3,3) =3! (#2,3,3,4) via the sequence 3,3,3,3 connects 
(#2,1) and (#2,2), and 3,3,3,3 connecting (#2,2) and (#2,3). (Hence the combination of sequences 
3,3,3,3,3,3 connects (#2, 1) and (#2,3) directly.) As no jumps from a low index £ to a high index h 
needs to be taken, we also have (#2, 1,3,2) fc= (#2,2,3,3) ^ (#2,3,3,4). Thus (# 2 , 1,3,2) = (#2,2,3,3) = 
(#2,3,3,4) }. In fact, { (#2, 1,3,2), (#2,2,3,3), (#2,3,3,4)} is an equivalence class for 3£ corresponding 



to the guide #2 (cf. Lemma 10 1. Differently, we have (#2, 1,3,2) ^ (#i,2,6,5) relating #2 to the fourth 
occurrence of #1 via the sequence 3,3,3,3,3,5,5,5, for example. Since there is a jump here from ^2 = 3 
to \i2 = 5, we do not have (#2, 1,3,2) > (#i,2,6,5). This reflects that apparently the rewrite with this 
occurrence of #1 is on top of part of the rewrite using #2 as guide. 

Given a slice sequence a, the ordering =^ on the chunks of a in gives rise to a partial ordering on 
the set X 1= of equivalence classes of chunks. As we will argue, the equivalence classes correspond 
to guides and their ordering corresponds to the relative order in which the guides occur in a rewrite 
sequence p having the same yield as the slice sequence a. 

Lemma 10. (a) The relation ^ on is reflexive and transitive. 

(b) The relation = on 3£ such that x = y <^=^> x =4 y Ay =4 x is an equivalence relation. 

(c) The ordering =^ on X /= induced by =4 on X by [x] =<! [y] <J=^ 3x* G [jc] 3y' G \y] : x' =4 y', makes 
2£ 1= a partial order. □ 

The next lemma describes the form of the equivalence class holding a chunk x = (g,q,i,n). Using 
the cuts, equivalent chunks can be found backwards up to position n—q+l and forward up to position 
n-q+#g. These chunks together, (g,l,i n _ q+ i,n-q+l), (g,q,i n ,n), (g,#g,i n - q+#g ,n-q + #g) 
span the guide # that is to be applied, in the rewrite sequence to be constructed. 

Lemma 11. Let o = (s£ n )f t l L l be a slice sequence for a string u. Let X = { (g n j,q n j,i,n) | 1 ^ n ^ 
#u, i G I n } be the set of chunks and choose x G X , say x = (g,q,i,n). Put p = n—q. Then there exist 
ji G Ip+i, . . ., j# g G I p+#g such that [x] = { (g,s,j s ,p + s) | 1 < s < ## }. □ 

We are now in a position to prove the reverse of Theorem [8] 

Theorem 12. Let o be a slice sequence for a string u. Then there exists a guided rewrite sequence p 
for u such that yield(p) = yield(o). 

Proof. Suppose a = (s£ n s£ n = (gi, n ,qi,n )iei„, for n = 1,. . . ,#u, and let = { (g„j,q n j,i,n) | 1 ^ 
n ^ #u, i G /„ } be the corresponding set of chunks. We proceed by induction on # Basis, = 0: 
In this case every slice is empty and yield(a) = yield{s£\ ) ■ ■ -yield(s£# u ) = u[l] ■■■ ■ u[#u] = u and the 
empty guided rewrite sequence for u has also yield u. 

Induction step, # > 0: Clearly, X j= is finite and therefore we can choose, by Lemma [To| x G 



such that [x] is maximal in 3£ j= with respect to By Lemma 1 1 we can assume [x] = { (g,s,i s ,p + s 



58 



Combining Insertion and Deletion in RNA-editing Preserves Regularity 



1 ^ s ^ #g } for suitable p and indexes i s E I p+S , for 5 = 1, . . . ,#g. Note, by maximality of [x], the 
indexes i s must be the maximum of I p+S . In particular, yield{ o)\p + s] = yield(s£ p+s ) = g[s], for s = 
1,...,#£. 

Now, consider the slice sequence a' = (s£' n )*j =1 where 



st 



s£ n for n = 1 , . . . , p and n = p+#g+ 1,. . . ,#u 

( 8i,n , qi,n )iei„\{i„-,,} for n = p+l,..., p+#g 



So, the slice sequence a' is obtained from the slice sequence a by leaving out the guide-offset pairs 
related to the particular occurrence of g. 

Let 3£' be the set of chunks of a'. Then <#3C . By induction hypothesis we can find a 
guided rewrite sequence p' = {g' k ,p' k ) r k=i for u such that yield(p') = yield(o'). Define the guided 
rewrite sequence p = (gk,PkY k t\ b y 8k = g' k , Pk = p' k for k = l,...,r and g r+l = g, p r+i = p. We 
have ^ p ^ #u—#g and ~ g since ■ • • are slices for m[/H-1], . . . 

respectively. So, p is a well-defined guided rewrite sequence for u. 

It holds that yield( p' ) p yieW( p ) as p extends p' with the pair (g, p) . Therefore, 



yield{p)[n\ 



yield(p') [n] for n = 1, . . . ,p and n = p+#g+ 1,..., p+#g 
g[n-p] for n = p+\,...,p+#g 



From this it follows, for any index n, 1 ^ n ^ p or p+#g+ 1 ^#11, that yield{ p)[n}= yield{ p' ) [n] = 
yield(& )[n] = yield(o)[n], and for any index n, p+l ^ n ^ p+#g, that yield{p )[n] = g[n— p] = 
yield( o~){n]. As #yield( p ) = #yield( a) = #u, we obtain yield{ p ) = yield{ a ) , as was to be shown. □ 



For the slice sequence (j^-)f=i of Figure|2]we have the following equivalence classes of chunks: 

G 3 = {(#3,1,1,3)} G 2 = {($2,1,3,2), (£2,2,3,3), (£2,3,3,4)} 

G\ = {(£!, 1,2, 1), (£ b 2,2,2)} Gj = {(£1,1,5,4), (£1, 2,5,5)} 
G\ = {(£!, 1, 4,1), On, 2,4,2)} G\ = {(£i,l,6,4), (£i,2,6,5)} 

Moreover, G 3 ^ G\ =^ G 2 , G 2 ^ G^ and G 2 =^ G\ 4 G\. A possible linearization is G 3 =^ G\ 4 G 2 =4 
Gj ^ G j =^ G\. This corresponds to the rewrite sequence 

ebcfa =^ 3;2 => gl . fbdfa => g2t ifacea => gl3 facfb =+ glt3 facfb => gl fbcfb 

Note that the yield/&c/fc of this rewrite sequence is the same as the yield of the sequence ([I]) of Example|7] 
However, here the second rewrite with £1 of ([T]) has been moved to the end. This does not effect the end 
result as the particular rewrites do not overlap. 



6 Guided rewriting preserves regularity 

Given a language L and a set of guides G, the language Lq is given as the set { v G S* | 3u G L : u v }. 
One of the main results of this paper, Theorem [3] formulated in Section [4j states that if L is regular 
than Lq is regular too. We will prove the theorem by constructing a non-deterministic finite automaton 
accepting Lq from a deterministic finite automaton accepting L. The proof exploits the correspondence 
of rewrite sequences and slice sequences, Theorem [8] and Theorem 12 First we need an auxiliary result 
to assure finiteness of the automaton for Lq. 
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Lemma 13. Let G be a finite set of guides. Let Z = {s£ \ s£ repetition-free slice for a and G, a G £ }. 
Then Z is finite. Moreover, for every string u and every rewrite sequence p for u, there exists a slice 
sequence O for u consisting of slices from Z only such that yield(o) = yield(p). 

Proof sketch. Finiteness of Z is immediate: there are finitely many guide-offset pairs (g, q), hence finitely 
many repetition-free finite sequences of them. Thus, there are only finitely many repetition-free slices. 

Now, let p be a rewrite sequence for a string u. By Theorem[8]we can choose a slice sequence o' such 
ihatyield(a') = yield(p). Suppose a' = (s£ n )f l t l ands£„ = (gi, n ,qi.n)iei„ forn = 1, . . . ,#k. By Lemma[TT] 
it follows that given a repeated guide-offset pair (g,q), say (g,q) = (gi,„,g;, n ) and (g,q) = (gj, n ,qj, n ) for 
indexes i < j in /„, we can delete the complete equivalence class of (gi,qi,i,n) from slices s£ n - q+ \ 
to s£ n ^ q+ # g , while retaining a slice sequence. In fact, we are removing the 'lower' occurrence of the 
guide g. Moreover, the resulting slice sequence has the same yield as for all slices the topmost guide- 
offset pair remains untouched. The existence of a repetition-free slice sequence a such that yield(a) = 
yield(o'), hence yield(a) = yield(p), then follows by induction on the number of repetitions. □ 

As a corollary we obtain that every rewrite sequence has a repetition-free equivalent, an intuitive result 
which requires some technicalities though to obtain directly. 

We are now prepared to prove that guided rewriting preserves regularity. 

Proof of Theorem^ Without loss of generality e L. Let M = (£, Q,— >,q ,F) be a DFA accepting L. 
We define the NFA M' = (E, Q' ,q ,F') as follows: Let q F be a fresh state. Put Q' = QU (Q xZ) U 
{qF} with Z as given by Lemma [13} F' = {q F } and 



go — >' qo X C if ^ is a start slice 
q x £ A' q' x if q A q', a~ £, yield(Q = b, £ ~» £' 
q x £ -V q F if 3q' : g A g' E F, a~ £, yield(Q = b, £ is an end slice 



Note, by Lemma[T3] Q' is a finite set of states. The automaton M' has only one final state, viz. qp. 

Suppose v G Lg- Then there exist u = a\ • • -a s G L, a rewrite sequence p = {gk,Pk)k=i an d strings 
uq,u\, . . . ,u r such that m = wq> M£_i => gk , Pk Wfc for & = l,...,r, and v = w r . Let, by Theorem [8] and 



Lemma 13 a be a slice sequence for u of repetition-free slices with yield(a) = yield(p). Say a 
(s£ n )^L i and s£ n = (gi,n,qi,n)iei„ for n= 1, . ,#m. Let go — ^> <?i • • • <?j 6 F be an accepting computation 

of M for u. Then go — >' go x ^i —^ ,/ ■ • '^s-l x^A - £ > / 9f is an accepting computation of M'. Since we 
have 2>i • • b s = yield{s£\) ■ ■ ■ yield(s£ s ) = yield(a) = v, it follows that v G jSf(M'). So, L G C 5£[M'). 
Let v = &i • • • £ s be a string in «Sf (M ; ). Given the definition of the transition relation on M', we can 

find states qo,qi, ■ ■ . ,q s -i, repetition-free slices s£\,. ..s£ s such that s£ n ~» sl n+ \ for n = 1, . . . ,5—1, and 

e / „ fci i w - ft.,, / 



a computation go — <7o x ^i — ' ' ■qs-i'xsis -A-' qp. Thus, there exist a final state g. v and a computation 

' a„. Put u = ai • • -a v . 
we can find a rewrite 



go -A q\ ■ ■ ■ q s -i -A g s G F such that a n ~ ^ for n = 1 , . . . , s, i.e. s£ n is a slice for a n . Put u = a\ - ■ ■ a s 



12 



Then hGL, 04)*^ is a slice sequence for w and yield(a) = v. By Theorem 
sequence p for w such that yield(p) = yield(a) = v. It follows that u v and v G L^. Thus, J2?(M') C 
Lg- We conclude that Lg = J£?(M') and regularity of Lg follows. □ 

Since LCL G the automaton M' should accept any word a\ . . .a s G L, s > 0. This can be verified as 
follows. Let be the empty slice for at, i = 1 . . .s. Then a, ~ i.e. a,- = yield(i^i), which holds by 
definition. Moreover, £i is a start slice, -w for / = 1 . . .s— 1, and £. s is an end slice. It follows that 
we can turn an accepting computation oi M, say go — > q\ — ► • • ■ — Y q s G F into an accepting computation 

of M : g ^'g x Ci —t'qi x C2 — > • • ■ x Q s -*'q F . 
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We now return to a proof of Theorem [T] formulated in Section[3]for which we want to apply Theorem[3] 
For the latter theorem to apply we need a preparatory transformation. The point is, in the setting of 
guided insertion/deletion, strings are allowed to grow or shrink while guided insertions and deletions are 
being applied, whereas in the setting of guided rewriting the strings do not change length. 

The key idea of the transformation is that every group of 0's is compressed to a single symbol. Let 
a language L over £ and a number k be given by Theorem [T] So, L does not contain strings with k or 
more 0s. We introduce k fresh symbols Oo,Oi, . . . , i . Put = {Oo,Oi, ...,0k-\ }. For any string u 
over £ not containing the substring 0*, i.e. not containing k or more zeros, we define the string u over the 
alphabet £ = (£\{0}) U {Oo,Oi, . . . ,0,t-i} that is obtained from u by replacing every maximal pattern ! 
by the single symbol 0,-. Note, between two consecutive non-zero letters ah the symbol Oo is interspersed. 
For instance, for k ^ 3, 10023 = 10220o3. Also note, that the compression scheme constitutes a 1-1 
correspondence of £* n { w \ w has no substring 0* } and f © • £o) • ©• 

Next, we show that the above operation of compressing groups of 0s preserves regularity using basic 
closure properties of the class of regular languages, cf. Q Section 3] . 



Lemma 14. Let Lbe a language without strings containing and let L = {u \ u £ L}. Then L is regular 
if and only ifL is regular. 

Proof. The language L is the homomorphic image of L for h: £* — > £* with /z(0,) = 0' and h(a) = a 
otherwise. So, if L is regular, so is L. Reversely, L = (0 • £)* • n h' 1 (L). Hence, if L is regular, so 
is I. □ 



With the above lemma in place we can give a proof of the preservation of regularity by guided inser- 
tion/deletion. 



Proof of Theorem^ Let k be as given by the statement of the theorem. Obtain L by applying the com- 



pression of strings ! , for i < k, changing from the alphabet £ to £, as introduced above. By Lemma 14 



we then have that L is regular. Let G be obtained from G, again by compression of strings ! , for i < k. 
Then G is a finite set of guides with respect to £. Now let the adjustment relation ~ be the equivalence 
relation on £ generated by 0, ~ 0j, ^ i, j < k. By Theorem[3]we obtain that Lq is regular. 

Next we note that if u v with respect to £, then u v with respect to £. Vice versa, if u => v and 
there exist (unique) u and v such that u,v map to u,v under compression, then u =^,yd v. It follows that 



Lq and L i j d coincide. Finally, by another application of Lemma 14 we conclude that L,y rf is regular. □ 



7 Related work and concluding remarks 

In this paper we have discussed abstract concepts of guided rewriting: a more flexible notion focusing on 
insertions and deletions of a dummy symbol, another more strict notion based on an equivalence relation. 
Given a language L we considered the extended languages L(u and Lq comprising the closure of L for 
the two types of guided rewriting with guides from a finite set G. In particular, as our main result we 
proved that these closures preserve regularity. For doing so we investigated the local effect of guided 
rewriting on two consecutive string positions, leading to a novel notion of a slice sequence. Finally, 
the theorem for adjustment-based rewriting was proved by an automaton construction exploiting a slice 
sequence characterization of guided rewriting. Via a compression scheme for strings of dummy symbols, 
the theorem for guided insertion/deletion followed. 
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Preservation of regularity by closing a language with respect to a given notion of rewriting arises as a 
natural question. In Section [5] we observed that by closing the regular language Jz? ( (ab)* ) under rewrit- 
ing with respect to the single rewrite rule ba — > ab the resulting language is not regular. So, by arbitrary 
string rewriting regularity is not necessarily preserved. A couple of specific rewrite formats have been 
proposed in the literature. In O it was proved that regularity is preserved by deleting string rewriting, 
where a string rewriting system is called deleting if there exists a partial ordering on its alphabet such 
that each letter in the right-hand side of a rule is less than some letter in the corresponding left-hand 
side. In (9l it was proved that regularity is preserved by so-called period expanding or period reducing 
string rewriting. When translated to the setting of lfl"5ll . as also touched upon in Section [3] our present 
notion of guided insertions and deletions allows for simultaneous insertion and deletion of the dummy 
symbol. A phenomenon also supported by biological findings. Remarkably, the more liberal guided 
insertion/deletion approach preserves regularity, whereas in the more restricted mechanism of IPT31 , not 
mixing insertions and deletions per rewrite step, regularity is not preserved. As another striking differ- 
ence with the mechanism of [15], for that format it was shown that strings u, v of length n exist satisfying 
u v, but the length of the reduction is at least exponential in n. In our present format this is not the 
case: we expect that our slice characterization of guided rewriting serves to prove, that if u =^>* v then 
there is always a corresponding reduction of length linear in the length of u and v. Details have not been 
worked out yet. 

The notion of splicing, inspired by DNA recombination, has been proposed by Head in Q . A so- 
called splicing rule is a tuple r = (wi,vi;i<2,V2). Given two words w\ = x\U\V\y\ and W2 = X2M2V2J2 
the rule r produces the word w = x\u\Viy2- So, the word w\ is split in between u\ and vi, the word W2 
in between and V2 and the resulting subwords x\U\ and V2J2 are recombined into the word w. For 
splicing a closure result, reminiscent to the one for guided rewriting considered in this paper, has been 
established. Casted in our terminology, if L is a regular language and S is a finite set of splicing rules, 
then Ls is regular too, cf. lHJQj]]. Here, L5 is the least language containing L and closed under the splicing 
rules of S. 

The computational power of a variant of insertion-deletion systems was studied in |[T4ll . There dele- 
tion means that a string uav is replaced by uv for a predefined finite set of triples u, a, v, while by insertion 
a string uv is replaced by uav for another predefined finite set of triples u, a, v. This notion of insertion- 
deletion is quite different from ours, and seems less related to biological RNA editing. In the same 
vein are the guided insertion/deletion systems of 0. There a hierarchy of classes of insertion/deletion 
systems and related closure properties are studied. Additionally, a non-mixing insertion/deletion system 
that models part of the RNA-editing for kinetoplastids is given. A rather different application of term 
rewriting in the setting of RNA is reported in where the rewrite engine of Maude is exploited to 
predict the occurrence of specific patterns in the spatial formation of RNA, with competitive precision 
compared to techniques that are more frequently used in bioinformatics. 

Possible future work includes investigation of preservation of context-freedom and of lifting the 
bound on the number of consecutive O's in Theorem [T] More specifically, for a context-free language L, 
does it hold, for a finite set of guides G, that Lq is context-free too? Considering the set of guides, a 
generalization to regular sets G is worthwhile studying. Note that the counter-example given in Section[4] 
involves a non-regular set of guides. So, if L is regular and G is regular, do we have that Lq is regular? 
Similarly for L context-free. We also plan to consider guided rewriting based on other types of adjustment 
relations. In particular, rather than comparing strings symbol-by-symbol, one can consider two strings 
compatible if they map to the same string for a chosen string homomorphism. A prime example would 
be the erasing of the dummy in the context of Section[3]for which we conjecture a variant of Theorem[3] 
to hold. 
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